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Abstract 

Bag-of -words  (BoW)  methods  are  a  popular  class  of 
object  recognition  methods  that  use  image  features  (e.g., 
SIFT)  to  form  visual  dictionaries  and  subsequent  histogram 
vectors  to  represent  object  images  in  the  recognition  pro¬ 
cess.  The  accuracy  of  the  BoW  classifiers,  however,  is  often 
limited  by  the  presence  of  uninformative  features  extracted 
from  the  background  or  irrelevant  image  segments.  Most 
existing  solutions  to  prune  out  uninformative  features  rely 
on  enforcing  pairwise  epipolar  geometry  via  an  expensive 
structure -from-motion  (SfM)  procedure.  Such  solutions  are 
known  to  break  down  easily  when  the  camera  transforma¬ 
tion  is  large  or  when  the  features  are  extracted  from  low- 
resolution,  low-quality  images.  In  this  paper,  we  propose  a 
novel  method  to  select  informative  object  features  using  a 
more  efficient  algorithm  called  Sparse  PCA.  First,  we  show 
that  using  a  large-scale  multiple-view  object  database,  in¬ 
formative  features  can  be  reliably  identified  from  a  high¬ 
dimensional  visual  dictionary  by  applying  Sparse  PCA  on 
the  histograms  of  each  object  category.  Our  experiment 
shows  that  the  new  algorithm  improves  recognition  accu¬ 
racy  compared  to  the  traditional  BoW  methods  and  SfM 
methods.  Second,  we  present  a  new  solution  to  Sparse  PCA 
as  a  semidefinite  programming  problem  using  Augmented 
Lagrange  Multiplier  methods.  The  new  solver  outperforms 
the  state  of  the  art  for  estimating  sparse  principal  vectors  as 
a  basis  for  a  low -dimensional  subspace  model.  The  source 
code  of  our  algorithms  will  be  made  public  on  our  website. 

1.  Introduction 

In  the  past  decade,  the  exponential  growth  of  storage  ca¬ 
pacity  has  encouraged  people  to  upload  personal  images  to 
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large  online  image  databases  such  as  Picas sa  and  Flickr. 
The  proliferation  of  modern  smartphones  equipped  with 
low-quality  mobile  cameras  has  also  garnered  interest  to 
endow  smartphone  users  with  the  ability  to  automatically 
recognize  common  objects  and  landmark  buildings  in  man¬ 
made  urban  environments.  The  existence  of  common  ob¬ 
jects  and  landmarks  in  these  images  has  motivated  research 
in  visual  object  recognition  [7, 9,  12, 27].  Images  in  these 
coarsely  labelled  databases  are  used  to  train  classifiers  that 
can  be  used  to  recognize  different  object  categories.  To 
tackle  the  problem  of  recognizing  a  large  number  of  ob¬ 
jects  in  large  image  databases,  a  visual-dictionary  based  ap¬ 
proach  has  been  well  studied  [19,  21],  which  have  further 
led  to  several  other  methods  to  recognize  objects  in  both  the 
single-view  and  multi-view  settings  [3,4, 8, 17,23,25].  Es¬ 
sentially,  most  of  the  methods  work  with  certain  visual  de¬ 
scriptors  ( e.g .,  SIFT  and  its  many  variants)  extracted  from 
the  images  to  construct  visual  histograms,  which  represent 
the  object  appearance  in  the  images  using  a  precomputed 
visual  dictionary. 

Although  the  visual-dictionary  methods  have  proven  to 
be  efficient  in  describing  object  images,  the  accuracy  of 
the  classifiers  is  often  limited  by  the  presence  of  uninfor¬ 
mative  image  features  typically  extracted  from  the  back¬ 
ground  or  irrelevant  image  segments,  such  as  pedestrians 
and  vegetation  (see  Figure  1  for  an  example).  When  the  ir¬ 
relevant  segments  take  on  a  significant  portion  of  an  image, 
the  uninformative  features  can  dominate  the  representation 
in  the  visual  histogram,  and  hence  lead  to  inferior  recog¬ 
nition  accuracy.  In  [2  ],  Turcot  and  Lowe  suggested,  if  a 
subset  of  so-called  useful  features  or  informative  features 
can  be  systematically  selected  during  the  training  stage,  it 
not  only  further  reduces  the  number  of  visual  descriptors 
needed,  but  also  significantly  improves  the  recognition  ac¬ 
curacy.  Since  in  man-made  environments,  most  objects  of 
interest,  in  particular  landmark  buildings,  are  rigid  objects, 
3-D  perspective  geometry  can  be  leveraged  to  select  infor¬ 
mative  features  that  satisfy  a  pairwise  epipolar  constraint 
via  RANSAC.  This  is  known  as  the  Structure-from-Motion 
(SfM)  approach. 
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Motivated  by  the  literature,  in  this  paper,  we  study  how 
to  improve  informative  feature  selection  in  both  speed  and 
accuracy  from  possibly  low -re solution,  low-quality  camera 
networks.  One  major  problem  in  enforcing  the  epipolar 
constraint  on  images  collected  from  low-power  camera  net¬ 
works  instead  of  high-end  photography  is  that  establishing 
wide-baseline  feature  correspondence  of  SIFT-type  features 
is  known  to  be  brittle  even  using  state-of-the-art  bundle  ad¬ 
justment  techniques  [22].  In  addition,  the  quality  of  im¬ 
ages  sampled  from  low-power  camera  sensors  also  presents 
a  challenge  to  reliably  extract  image  features  to  describe  the 
appearance  of  interesting  objects  in  multiple  views. 

We  propose  to  address  this  problem  by  a  principled 
semidefinite  programming  (SDP)  technique,  known  as 
Sparse  Principal  Components  Analysis  (Sparse  PC  A)  [30]. 
As  an  extension  of  the  popular  PCA  method,  Sparse  PCA 
addresses  a  drawback  of  classical  PCA  that  the  principal 
vectors  (PVs)  as  a  basis  of  a  low-dimensional  subspace  typ¬ 
ically  have  dense  non-zero  entries.  In  particular,  in  high- 
dimensionality  setting,  the  dense  linear  combinations  of  all 
the  variables  make  it  difficult  to  interpret  the  corresponding 
principal  components  (PCs). 

In  case  of  visual-dictionary  based  object  recognition,  the 
variables  in  a  high-dimensional  histogram  are  associated 
with  the  codewords  that  represent  either  informative  fore¬ 
ground  features  or  uninformative  background.  We  contend 
that  in  a  large-scale  object  image  database,  the  subset  of 
informative  features  can  be  reliably  selected  by  the  sparse 
coefficients  in  the  first  few  PVs.  The  new  solution  is  more 
robust  to  wide-baseline  camera  transformation  and  numeri¬ 
cally  more  efficient  than  the  existing  solutions  of  establish¬ 
ing  pairwise  rigid-body  correspondence. 

1.1.  Main  Contributions 

In  this  paper,  we  exploit  the  use  of  Sparse  PCA  as  a 
variable  selection  tool  for  selecting  informative  features 
in  the  object  images  captured  from  low-resolution  cam¬ 
era  sensor  networks.  Firstly,  we  present  a  scheme  for  us¬ 
ing  Sparse  PCA  with  high-dimensional  covariance  matri¬ 
ces  constructed  from  visual  histograms  to  extract  a  sparse 
support  of  visual  codewords  for  each  object  category.  We 
compare  its  performance  with  the  SfM  technique  applied 
to  large-baseline,  low-quality  multiple-view  images.  Sec¬ 
ondly,  we  propose  a  state-of-the-art  algorithm  to  speed 
up  Sparse  PCA  using  the  Augmented  Lagrange  Multiplier 
(ALM)  approach  [2,2  ].  To  mitigate  the  high  dimension¬ 
ality  of  the  visual  dictionary,  a  direct  variable  elimination 
method  called  SAFE  is  presented  to  further  prune  out  unin¬ 
formative  features  for  object  recognition  prior  to  the  Sparse 
PCA  process.  The  experiment  on  synthetic  data  shows  that 
the  new  algorithm  outperforms  the  previous  convex  pro¬ 
gramming  algorithm  (DSPCA)  [5]  in  terms  of  speed  while 
maintaining  the  same  estimation  accuracy.  Finally,  we  per¬ 
form  object  recognition  experiments,  which  demonstrate 


improved  recognition  by  successfully  suppressing  uninfor¬ 
mative  features.  To  aid  peer  evaluation,  the  source  code  of 
our  algorithms  will  be  made  public  on  our  website. 


2.  Recognition  via  Vocabulary  Trees 

In  object  recognition,  certain  local  invariant  features 
have  become  a  popular  representation  of  the  object  images, 
which  can  be  extracted  and  encoded  into  high-dimensional 
descriptors  using  algorithms  such  as  SIFT  [15]  and  SURF 
[]].  In  the  bag-of- words  (BoW)  approach,  these  invariant 
features  are  further  quantized  to  form  a  dictionary  of  visual 
words.  All  the  feature  descriptors  in  the  training  set  are  hi¬ 
erarchically  clustered  into  visual  word  clusters  ( e.g .,  using 
hierarchical  k-means  [13]).  This  hierarchical  tree  is  com¬ 
monly  referred  to  as  a  vocabulary  tree  [19].  The  size  of  a 
vocabulary  tree  for  a  large  database  ranges  from  thousands 
to  hundreds  of  thousands.  For  example,  in  this  paper,  we 
use  hierarchical  k-means  to  construct  1 ,000-D  vocabularies 
for  our  training  image  database,  with  a  branch  factor  of  k  = 
10  and  four  hierarchies. 

To  start  the  training  process,  feature  descriptors  in  each 
training  image  are  propagated  down  the  vocabulary  tree  to 
form  a  BoW  model  for  the  image.  Then  a  term-frequency 
inverse-document-frequency  ( tf-idf )  weighted  visual  his¬ 
togram  y  is  defined  for  each  training  image  [19].  For  each 
object  category,  i  =  1  •  •  •  C,  m  weighted  histograms  are 
generated  from  the  m  training  images  of  that  category  re¬ 
spectively:  Ai  =  {y1,y25 ' ' '  iHm}-  All  the  C  sets  form 
the  training  set,  A  =  {Ai,  A2,  •  •  •  ,  Ac}. 

During  the  testing  phase,  feature  descriptors  are  ex¬ 
tracted  for  the  query  image  and  propagated  down  the 
vocabulary  tree  by  the  same  fashion  to  obtain  a  single 
weighted  query  histogram  q.  Using  the  simplest  nearest- 
neighbor  classifier,1  the  query  image  is  then  given  a  rel¬ 
evance  score  s  based  on  the  i\ -normalized  difference  be¬ 
tween  the  weighted  query  and  the  it h  training  set  Af. 
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Finally,  the  label  of  the  visual  histogram  q  is  assigned  as  the 
object  category  that  achieves  the  minimal  relevance  score: 

label(qr)  =  argmin  s(q,Ai).  (2) 


2.1.  Failure  of  SfM  on  low-quality  images 

It  was  suggested  by  Turcot  and  Lowe  [24]  that  the  accu¬ 
racy  of  object  recognition  in  large  image  databases  can  be 
improved  by  suppressing  uninformative  visual  words  that 
typically  represent  irrelevant  image  background.  In  [24], 
SfM  techniques  were  used  to  enforce  pairwise  epipolar  con¬ 
straints  of  rigid  objects.  The  authors  argued  that,  between 

'in  the  literature,  more  sophisticated  classifiers  such  as  SVMs  have  also 
been  used.  Nevertheless,  this  is  not  the  focus  of  the  paper. 


(a)  Original  SURF  feature  detection  results. 


(b)  Informative  features  detected  by  SfM. 


(c)  Informative  features  selected  by  thresholded  PC  A  based  on  the  first  two  leading  PVs. 


(d)  Informative  features  selected  by  Sparse  PC  A  based  on  the  first  two  leading  PVs. 

Figure  1.  Comparison  of  informative  feature  selection  on  low-quality  multiple- view  images.  A  subset  of  16  training  images  of  a  building 
(Campanile  at  UC  Berkeley)  in  the  BMW  database  [17]  are  used  for  training.  For  each  image  pair  in  SfM,  SURF  features  are  deemed 
informative  if  the  consensus  of  the  corresponding  epipolar  constraint  exceeds  25%  of  the  total  feature  pairs.  For  thresholded  PCA,  we 
manually  assign  small-valued  entries  to  zero  in  PVs  in  attempt  to  achieve  the  same  sparsity  as  Sparse  PCA.  The  best  results  to  identify 
informative  features  on  the  Campanile  are  given  by  Sparse  PCA. 


a  pair  of  images  that  render  the  same  object  in  space,  un¬ 
informative  features  can  be  easily  pruned  out  as  outliers 
w.r.t.  a  dominant  epipolar  constraint  by  RANSAC.  Along 
similar  lines,  Philbin  et  al.  [20]  introduced  a  Geometric  La¬ 
tent  Dirichlet  Allocation  model  for  constructing  image  adja¬ 
cency  graphs.  Subsequently,  rich  latent  topic  models  were 
built  from  the  adjacency  graphs  with  the  identity  and  lo¬ 
cations  of  visual  words  specific  to  the  objects,  thereby  re¬ 
jecting  uninformative  visual  words.  Knopp  et  al.  [\  ]  aug¬ 
mented  query  images  with  rough  geolocation  information 
combined  with  wide-baseline  feature  matching  to  detect 
and  suppress  uninformative  features  before  invoking  vocab¬ 
ulary  tree  based  object  recognition. 

All  these  methods  rely  on  the  accuracy  of  wide-baseline 


feature  matching  to  establish  pairwise  epipolar  geometry. 
However,  they  tend  to  fail  when  the  quality  of  the  images 
in  the  database  is  very  poor,  as  is  the  case  with  images 
captured  from  mobile  cellphones  or  distributed  camera  net¬ 
works.  Furthermore,  man-made  landmarks  such  as  build¬ 
ings  often  have  repetitive  texture  and  patterns  that  tend  to 
confuse  feature  correspondence  algorithms  ( e.g .,  Bundler 
[22]).  Figure  1  (b)  shows  an  example  where  SfM  fails  at  de¬ 
termining  the  wide-baseline  transformation  across  images 
of  an  object  captured  from  multiple  vantage  points.  More 
examples  can  be  found  in  Figure  4  later. 


3.  Identifying  Informative  Features 

Classical  PCA  is  a  well  established  tool  for  the  analysis 
of  high-dimensional  data.  For  a  data  matrix  A,  PCA  com¬ 
putes  the  PCs  via  an  eigenvalue  decomposition  of  its  empir¬ 
ical  covariance  matrix  E.  It  has  also  been  observed  that  in 
general  the  entries  of  the  corresponding  PVs  are  dense  and 
nonzero.  In  certain  applications,  it  is  desirable  to  obtain 
PVs  that  can  explain  maximum  variability  in  the  data  A  us¬ 
ing  linear  combinations  of  just  a  few  nonzero  variables,  and 
hence  improves  interpretability  of  such  data.  It  is  with  this 
motivation  that  Sparse  PCA  was  developed  [5, 30]  and  has 
proven  to  be  a  very  useful  tool  for  identifying  focalized  hid¬ 
den  information  in  data  where  the  coordinate  axes  involved 
have  physical  interpretations. 

In  the  BoW  approach  to  object  recognition,  each  coor¬ 
dinate  axis  in  the  visual  histogram  corresponds  to  a  partic¬ 
ular  visual  word  in  the  vocabulary  tree.  We  contend  that 
the  visual  words  that  explain  maximum  variability  in  data 
corresponding  to  each  object  category  can  be  regarded  as 
informative  features  for  object  recognition.  In  order  to  use 
Sparse  PCA  to  identify  these  visual  words,  an  empirical  co- 
variance  matrix  must  first  be  computed  for  each  object  cat¬ 
egory  in  the  database. 

Let  us  consider  m  available  training  images  of  an  ob¬ 
ject  category.  Using  the  constructed  vocabulary  tree  learned 
from  ah  the  categories,  the  SURF  descriptors  in  each  im¬ 
age  are  converted  into  a  visual  histogram  y  G  Mn.  The 
m  vectors  {Vj}  are  then  normalized  to  have  unit  length 
and  centered,  and  grouped  into  a  data  matrix :  A  = 
[2/i>2/2?’"  >  2/m]  C  Mnxm.  The  empirical  covariance 
matrix  is  then  computed  from  this  data  matrix  as  = 

m 

Sparse  PCA  that  computes  the  first  sparse  eigenvector  of 
E^  optimizes  the  following  objective  [30]: 

xs  =  argmax£cTE^£c  subj.  to  ||sc||2  =  1,  ||as||i  <  k. 

(3) 

We  denote  the  indices  of  the  non-zero  coefficients  in  xs  by 
1  ( i.e .,  the  nonzero  support  of  xs).  These  indices  corre¬ 
spond  to  the  visual  words  that  explain  maximum  variability 
in  A,  and  are  subsequently  used  in  the  object  recognition 
process  (explained  in  Section  6). 

In  practice,  it  is  common  that  the  leading  first  sparse  PV 
may  not  be  sufficient  for  obtaining  a  variable  support,  and  it 
is  desirable  to  further  estimate  a  few  subsequent  sparse  PVs 
as  well.  In  optimization,  it  is  a  common  practice  to  esti¬ 
mate  succeeding  eigenvectors  by  sequentially  deflating  the 
covariance  matrix  with  the  preceding  ones.  Several  tech¬ 
niques  have  been  explored  for  reliably  deflating  a  covari¬ 
ance  matrix  for  Sparse  PCA  [16].  We  adopt  a  simple  tech¬ 
nique  called  Hotelling’s  deflation  that  eliminates  the  influ¬ 
ence  of  the  first  sparse  PV  to  obtain  a  deflated  covariance 
matrix  E^  as  follows: 

Yj  a  =  ^ A  Yaxs^)xsxs  .  (4) 


Then,  the  second  sparse  eigenvector  x's  of  EU  becomes 
the  leading  sparse  eigenvector  of  Y'A,  and  can  be  estimated 
again  by  Sparse  PCA  (3).  In  our  experiment,  we  observe 
that  the  first  two  sparse  PVs  are  sufficient  for  selecting  in¬ 
formative  features  that  he  on  the  foreground  objects  in  the 
BMW  database  (as  shown  in  Figure  1  and  4).  Finally,  If  we 
denote  the  indices  of  the  non-zeros  in  the  second  PV  x's  as 
1 ',  then  the  union  lUl'  provides  the  support  corresponding 
to  the  informative  features  of  a  particular  category. 

For  pedagogical  purposes,  we  also  compare  the  variable 
selection  performance  of  thresholded  PCA  in  Figures  1  and 
4.  To  obtain  a  sparsihed  PCA  support  set,  we  perform  PCA 
on  the  same  covariance  matrix  E^  and  pick  the  top  k  indices 
of  the  corresponding  first  and  second  PVs  with  highest  ab¬ 
solute  value  as  the  informative  features.  Here,  k  is  chosen 
as  the  same  cardinality  of  the  corresponding  Sparse  PVs  for 
the  same  category.  The  examples  clearly  show  that  major¬ 
ity  of  the  selected  features  do  not  represent  the  foreground 
objects. 

4.  Speeding  up  Sparse  PCA  using  ALM 

Sparse  PCA  has  been  an  active  research  topic  for  over 
a  decade.  Notable  approaches  include  SCoTLASS  [10], 
SLRA  [29],  and  SPCA  [30],  ah  of  which  aim  at  finding 
modified  PVs  with  sparse  entries.  However,  one  draw¬ 
back  of  ah  the  above  algorithms  is  that  the  formulation 
requires  solving  nonconvex  objective  functions.  Recently, 
d’Aspermont  et  al.  [5]  derived  an  f^-norm  based  semideh- 
nite  relaxation  for  Sparse  PCA  called  DSPCA,  and  it  is  cur¬ 
rently  the  most  widely  known  convex  formulation  of  the 
problem.  This  algorithm,  however,  has  a  slow  convergence 
rate  that  is  a  major  bottleneck  when  analyzing  high  dimen¬ 
sional  data.  Augmented  Lagrange  multiplier  (ALM)  based 
algorithms  have  recently  gained  a  lot  of  popularity  due  to 
their  rapid  convergence  and  speed  in  £i  -minimization  [26] 
and  Robust  PCA  [14]  problems.  These  have  motivated  us 
to  develop  a  new  algorithm  for  solving  the  semidehnite  re¬ 
laxation  form  of  Sparse  PCA  using  ALM. 

We  begin  by  showing  Sparse  PCA  can  be  converted  to 
a  SDP  [5].  Given  an  empirical  covariance  matrix  E  G  §n, 
with  n  representing  the  dimensionality  of  the  data,  Sparse 
PCA  solves  the  following  objective: 

max  xTYx  —  p\\x\\o,  (5) 

II*I|2<1 

where  p  >  0  is  a  scalar  parameter  controlling  the  sparsity  in 
x.  By  following  the  ^i-norm  relaxation  and  lifting  proce¬ 
dure  for  semidehnite  relaxation,  and  dropping  a  nonconvex 
rank  constraint,  we  can  rewrite  (5)  as  [5]: 

max  Tr(EX)  -  p\\X\\t  :  Tr(X)  =  1,  X  y  0, 2  (6) 

where  X  =  xxT  is  a  matrix  variable.  Duality  allows  us  to 
rewrite  this  problem  as  a  SDP: 

2In  this  paper,  ||X||i  represents  the  entry  wise  norm:  1T\X\1. 


min  Amax(E  +  U)  :  -  p<Uij  <  p.  (7) 

As  presented  in  [5],  assuming  E  is  fixed  and  given,  the  max¬ 
imum  eigenvalue  function  A max(m)  can  be  approximated  by 
a  smooth,  uniform  objective  (i.e.9  with  Lipschitz  continuous 
gradient): 

fn(U)  =  /xlog(Trexp((E  +  U)/p))  -  //log (n),  (8) 
=  exp((E  +  C/)//x)/TV(exp((S  +  [/)///)),  (9) 


where  p  =  e/2  log (n)  produces  an  e-approximate  solution. 
With  this  approximation,  (7)  can  be  rewritten  as: 


min  fn(U)  :  -  p<Uij  <  p.  (10) 

Based  on  the  above  SDP  formulation,  next  we  consider 
speeding  up  Sparse  PC  A  via  an  ALM  approach  [.  ] .  The 
basic  idea  is  to  eliminate  the  constraints  and  add  to  the  cost 
function  a  penalty  term  that  prescribes  a  high  cost  to  in¬ 
feasible  points.  This  augmented  cost  function  is  called  the 
augmented  Lagrangian  function.  In  our  case,  the  box  con¬ 
strained  convex  problem  of  (10)  can  be  written  in  an  uncon¬ 
strained  form  as: 


F(U,Y)=min{f^U)+  ]T  P^Y^c)},  (11) 

l<z,j<n 


where  Yij ,  1  <i,j  <  n  represents  the  Lagrange  variable,  c 
determines  the  severity  of  the  penalty,  and 


P(u,y,c) 


r  Co 

y(u  -  p)  + -{u  -  p)2  iip~l<u, 

<  y(u  +  p)  +  |(u  +  p)2  if ~p~-c>u, 


—  otherwise. 

2c  (12) 
The  algorithm  for  Sparse  PCA  using  ALM  (SPCA- 
ALM)  is  presented  in  Algorithm  1.  Note  that  in  each  it¬ 
eration  of  the  outer  loop  of  the  algorithm,  we  need  to  solve 
the  unconstrained  minimization  problem  in  (11),  which  has 
no  closed-form  solution.  Thus,  we  employ  Nesterov’s  first 
order  gradient  technique  [18].  Once  this  augmented  La¬ 
grangian  function  is  minimized,  the  Lagrange  multipliers 
Y  will  be  updated  using  the  rule: 


yk+ 1 
ij 


Y*  +  ck(U?j  -  p)  if  Y*  +  c*  ([/£  -  P)  >  0. 
Yk  +  cfe  (Uk  +  p)  if  Y*  +  ck  ( U *  +p)<  0, 

0  otherwise. 


(13) 

After  the  algorithm  converges,  the  primal  variable  is  given 
by  the  gradient  in  (9),  i.e.,  Xk  =  V/M(Lr/c).  Then  the  sparse 
principal  component  is  recovered  as  the  leading  eigenvector 
of  Xk. 


4.1.  Performance 

We  have  evaluated  our  SPCA-ALM  algorithm  by  com¬ 
paring  its  performance  against  the  DSPCA  solver  [5].  Both 


Algorithm  1:  SPCA-ALM 

Input:  Covariance  E  and  p  >  0. 

l 

U1  <-  0,  Y1  <-  0,  X1  «-  0,  c1  «-  1. 

2 

while  not  converged  (k=l,2,3,...)  do 

3 

t 1  <-  1,  V1  <-  Uk,  W°  ^Uk,Z  <-  rand(n,  n). 

4 

a0  <-  llv1— z||F 

a  ^  ||VF(y1,yfe)-VF(z,yfc)||F • 

5 

while  not  converged  (1=1, 2, 3,...)  do 

6 

Find  smallest  i  >  0  for  which 

7 

F(Vl,  Yk)  —  F(Vl  -  4^-  VF(Vl,Yk),Yk )  : 
^L\\VF(Vl,Yk)\\F. 

8 

a1  <-  2~ial~1,  Wl  <-  V1  -  alVF{Vl,Yk). 

9 

tl+1  <r-  (1  +  \/4t'2  +  l)/2. 

10 

Vl+1  <-  Wl  +  -  W1-1). 

11 

end  while 

12 

Uk+ 1  t-  wl 

13 

Update  Yk+1  using  the  update  rule  (13). 

14 

^fe+1  ^  Xf^Uk+1). 

15 

ck+l  2k' 

16 

end  while 

Output:  Sparse  principal  vector,  xs  leading 
eigenvector  of  Xk. 

algorithms  have  been  implemented  in  MATLAB  and  bench- 
marked  on  a  2.6  GHz  Intel  processor  with  4  GB  memory. 
We  generate  synthetic  data  of  varying  dimensionality  as  fol¬ 
lows.  First,  in  the  n-dimensional  vector  space,  10%  of  its 
indices  are  selected  as  nonzero  support.  Next,  the  values  of 
the  nonzero  coefficients  are  drawn  from  an  independent  and 
identically  distributed  Gaussian  xo(i)  ~  TV(0,200).  Fi¬ 
nally,  random  noise  e  ~  7V(0, 1)  is  added  to  xo  to  form 
a  noisy  version  of  the  empirical  covariance  matrix,  E  = 
(xo  +  cl) (xq  +  el)T.  This  covariance  matrix,  along  with  an 
optimal  choice  of  the  parameter  p  to  encourage  sparsity,  is 
provided  to  both  the  SPCA-ALM  and  DSPCA  algorithms. 
The  process  repeats  10  times  for  each  problem  dimension 
n,  while  n  varies  from  100  to  500  and  the  average  speed 
and  precision  are  computed  for  each  n.  Figure  2(a)  com¬ 
pares  the  speed  of  the  two  algorithms,  while  Figure  2(b) 
compares  the  estimation  error  of  the  first  estimated  sparse 
principal  vector.  The  simulation  shows  SPCA-ALM  con¬ 
verges  much  faster  than  DSPCA  (for  example,  at  n  =  500, 
SPCA-ALM  is  about  10  times  faster),  while  maintaining 
approximately  the  same  reconstruction  accuracy. 

5.  Variable  Elimination  via  SAFE 

In  this  section,  we  further  examine  a  dimensionality 
reduction  technique  as  a  preprocessing  step  to  speed  up 
Sparse  PCA.  Particularly  in  object  recognition,  the  covari¬ 
ance  matrix  E  often  can  be  of  high  dimension  ( e.g .,  1000 
and  higher).  Directly  calling  SPCA-ALM  may  still  be 


(a)  Speed  vs  Data  Dimension  (b)  Estimation  Error  vs  Data  Dimension 

Figure  2.  A  comparison  of  SPCA-ALM  and  DSPCA  using  simulated  data. 


0.15 


Figure  3.  SAFE  feature  elimination  process.  Top:  The  red  rows 
and  columns  of  a  sample  covariance  matrix  E  are  eliminated  to 
form  new  covariance  matrix  E,  as  the  corresponding  variances  are 
less  than  chosen  p  —  0.1.  Bottom:  The  entries  of  the  correspond¬ 
ing  indices  are  subsequently  zeroed  out  in  xs. 

very  time  consuming.  To  mitigate  this  problem,  we  invoke 
a  feature  elimination  method  presented  in  [6,  28],  called 
SAFE.  The  method  allows  to  quickly  eliminate  variables 
in  problems  involving  a  convex  loss  function  and  a  ^i-norm 
penalty,  thereby  leading  to  substantial  reduction  in  the  num¬ 
ber  of  variables  prior  to  running  optimization.  The  fol¬ 
lowing  theorem  [6, 28]  states  the  SAFE  method  applied  to 
Sparse  PCA.  An  illustration  of  this  process  is  shown  in  Fig¬ 
ure  3. 

Theorem  1  (SAFE  Variable  Elimination  for  Sparse  PCA). 
Given  a  covariance  matrix  E,  denote  cr^  as  its  kth  diagonal 
entry.  For  the  Sparse  PCA  problem  (5),  if  p  >  then  the 
kth  element  of  the  solution  xs  will  not  be  in  the  sparse  sup¬ 
port.  Hence,  the  kth  row  and  column  of  E  can  be  removed 
from  the  optimization. 

Therefore,  for  a  predefined  choice  of  p,  we  first  obtain  a 
reduced  covariance  matrix  by  eliminating  all  the  rows  and 
columns  corresponding  to  those  variables  with  sample  vari¬ 
ance  less  than  p.  The  number  of  variables  thus  eliminated 
is  a  conservative  lower  bound  on  the  total  number  of  zero- 
weight  variables  in  the  final  solution  of  Sparse  PCA.  In  our 
experiments,  we  typically  can  eliminate  about  90%  of  the 


variables  using  SAFE  without  sacrificing  the  accuracy  of 
preserving  important  informative  features. 

6.  Experiment 

In  order  to  test  the  effectiveness  of  suppressing  uninfor¬ 
mative  features  for  the  task  of  object  recognition,  we  have 
evaluated  the  performance  of  our  method  on  the  Berkeley 
Multiview  Wireless  (BMW)  database  [17].  The  database 
consists  of  multiple- view  images  of  20  landmark  buildings 
on  the  Berkeley  campus.  For  each  building,  wide-baseline 
images  were  captured  from  16  different  vantage  points.  Fur¬ 
ther,  at  each  vantage  point,  5  short-baseline  images  were 
taken  (by  five  camera  sensors  #0  -  #4  simultaneously), 
thereby  summing  to  80  images  per  category.  All  images 
are  640  x  480  RGB  color  images.  It  is  important  to  note 
that  the  image  quality  in  this  database  is  considerably  lower 
than  many  existing  high-resolution  databases,  which  is  in¬ 
tended  to  reproduce  realistic  imaging  conditions  for  mobile 
camera  and  surveillance  applications.  Further,  it  is  notice¬ 
able  that  some  images  are  slightly  out  of  focus  and  in  some 
cases,  even  corrupted  by  dust  residual  on  the  camera  lenses. 

We  divide  the  database  into  a  training  set  and  a  testing 
set.  The  vantage  points  of  each  object  are  named  numeri¬ 
cally  from  0  to  15.  All  these  16  images  of  each  category 
captured  from  camera  #2  are  designated  as  the  training  set, 
and  the  remaining  images  are  assigned  to  the  testing  set. 
Thus,  there  are  16  training  images  and  64  testing  images 
for  each  category.  We  extract  SURF  features  in  each  of  the 
training  images  and  construct  a  vocabulary  tree  with  1000 
leaf  nodes. 

6.1.  Results 

We  first  evaluate  the  recognition  accuracy  of  the  classi¬ 
fier  (2)  without  suppressing  any  features  from  the  training 
and  testing  sets  to  obtain  a  baseline  performance.  The  re¬ 
sults  of  this  experiment  are  presented  in  Table  1.  For  the 
20  object  categories  tested,  the  average  baseline  recognition 
rate  is  around  80%. 


Next,  for  each  object  category  i,  we  obtain  its  informa¬ 
tive  feature  set  1L  by  determining  the  indices  of  the  non-zero 
variables  in  the  first  and  second  sparse  PVs.  These  are  es¬ 
timated  by  running  Sparse  PCA  on  the  covariance  matrix 
corresponding  to  the  training  histogram  vectors  in  ith  cat¬ 
egory.  We  then  form  the  total  support  set  ^$pcA  f°r  the 
entire  database  by  taking  the  union  of  the  individual  visual 
support  sets  for  all  the  20  object  categories,  i.e., 

^spca  =  U I2  U  •  •  •  U  X20 . 

In  our  experiments,  we  have  set  the  sparsity  controlling  pa¬ 
rameter  p  to  0.002  for  all  the  categories.  With  this  choice 
of  p,  at  roughly  33  variables  per  category,  our  total  sup¬ 
port  set  ^spca  identifies  405  informative  features  (some 
informative  features  overlap  between  classes),  thereby  re¬ 
jecting  a  fraction  of  |  of  the  visual  words  from  the  1000-D 
vocabulary.  With  this  subset  of  visual  words,  we  evaluate 
the  recognition  accuracy  of  (2)  again.  The  results  are  also 
presented  in  Table  1 .  As  one  can  see,  for  most  of  the  cate¬ 
gories,  there  is  a  significant  improvement  in  the  recognition 
accuracy,  leading  to  the  average  recognition  rate  at  85%, 
5%  higher  than  the  baseline. 

For  completeness,  Table  1  also  shows  the  number  of  se¬ 
lected  features  and  the  recognition  rates  for  the  SfM  ap¬ 
proach.  For  a  large  number  of  the  object  categories,  the  SfM 
method  does  not  seem  to  work  well,  as  few  of  the  SURF  fea¬ 
tures  are  correctly  selected  as  foreground  features.  We  have 
tested  the  recognition  accuracy  of  these  visual  words  on  the 
database  as  well,  and  the  average  rate  is  78%,  even  lower 
than  that  of  the  baseline  performance.  Finally,  some  visual 
comparisons  between  the  results  from  Sparse  PCA  and  SfM 
are  presented  in  Figure  4. 

7.  Conclusion  and  Discussion 

We  have  presented  a  novel  and  effective  solution  to  select 
informative  features  for  object  recognition  by  Sparse  PCA. 
For  applications  that  involve  low-quality  mobile  cameras  or 
surveillance  camera  networks,  existing  SfM  solutions  to  de¬ 
tect  and  suppress  uninformative  features  tend  to  fail.  We 
have  shown  that  Sparse  PCA  can  successfully  identify  im¬ 
portant  visual  features  that  explain  maximum  variability  in 
the  visual  histogram  vectors.  For  our  database,  these  fea¬ 
tures  correspond  to  those  visual  words  that  most  often  repre¬ 
sent  the  appearance  of  foreground  objects.  To  further  speed 
up  the  execution  of  Sparse  PCA,  we  have  developed  an 
improved  numerical  algorithm,  namely,  SPCA-ALM.  The 
new  algorithm  has  proved  significantly  faster  than  the  other 
convex  semidefinite  programming  solutions.  Using  a  pub¬ 
lic  multiple- view  image  database,  our  experiment  shows  the 
estimated  informative  features  improve  the  overall  recogni¬ 
tion  rate  by  5%  compared  to  the  baseline  solution,  and  by 
7%  compared  to  the  SfM  solution. 

For  future  work,  we  believe  the  two  existing  approaches, 
namely,  Sparse  PCA  and  SfM,  are  complementary  under 


more  general  object  recognition  settings.  We  would  like  to 
focus  on  further  combining  our  batch  numerical  technique 
within  a  geometric  RANSAC  scheme  to  robustly  detect  in¬ 
formative  features  in  both  low-quality  and  high-quality  im¬ 
age  databases,  which  may  lead  to  further  improvement  of 
the  performance. 

Table  1 .  Recognition  rates  and  number  of  selected  informative  fea¬ 
tures  for  the  20  object  classes  in  alphabetical  order  [17].  The  best 
rates  are  in  bold  face.  The  categories  in  which  SfM  failed  have 
zero  feature  selected. _ 


Cat. 

Baseline 

SPCA 

SPCA 

SfM 

SfM 

Rate(%) 

Rate(%) 

#  Feat 

Rate(%) 

#  Feat 

1 

98.61 

94.44 

35 

83.33 

0 

2 

90.27 

91.66 

23 

90.27 

35 

3 

56.94 

66.66 

15 

58.33 

0 

4 

70.83 

81.94 

12 

65.27 

30 

5 

77.77 

91.66 

56 

81.94 

0 

6 

95.83 

88.88 

23 

87.50 

0 

7 

79.16 

93.05 

34 

86.11 

0 

8 

77.77 

91.66 

30 

72.22 

0 

9 

56.94 

73.61 

45 

63.88 

11 

10 

51.38 

65.27 

9 

61.11 

0 

11 

83.33 

76.38 

76 

69.44 

13 

12 

81.94 

83.33 

28 

70.83 

0 

13 

62.50 

72.22 

43 

52.77 

0 

14 

98.61 

93.05 

20 

90.27 

37 

15 

69.44 

80.55 

36 

75.00 

0 

16 

58.33 

79.16 

53 

80.55 

66 

17 

100.00 

90.27 

17 

84.72 

0 

18 

98.61 

93.05 

45 

100.00 

56 

19 

97.22 

83.33 

24 

86.11 

0 

20 

98.61 

100 

46 

95.83 

0 

Avg. 

80.02 

84.51 

33 

77.77 

12 
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